ON WORD BOUNDARIES IN JAPANESE 


John J. Chew, Jr. 
State University of New York at Buffalo 


It was often my experience, while working with the Foreign Service 
Institute's intensive Japanese language program, that native instructors, 
when called upon to prepare romanized texts, omitted word boundaries, or 
inserted them where they didn't belong. Not infrequently would I en- 
counter examples such as: sore ga wakari mdsuka? 'Do you understand 
that?' for sore ga wakarimásu ka?, abundi desuyo! 'It's dangerous.' for 
abundi desu yo:, or tanaka sanga byooki désukara 'because Mr. Tanaka is 
sick' for tanaka-san ga byooki désu kara. This seemed surprising to me 


at first, but others have told me of having the same experience. 


The phenomenon seems less odd when we consider that the spaces are 
not only not spoken, but, in the native orthography, not even written. 
How then is the Japanese to decide where the word boundaries do indeed 
belong? 


The magnitude of this task begins to become apparent when we con- 
sider that we, who have seen our words neatly spaced since the first 
grade, not infrequently have trouble deciding whether to write a given 
sequence as one or two words, or with a hyphen. 


The task of positing word boundaries confronted the first occiden- 
tals who attempted to produce romanized texts in Japanese. These pio- _ 
neers, without attempting a definition of the word, simply wrote what they 
assumed to be words. 


The first serious attempt by a linguist to define the word in Japan- 
ese was made by Bernard Bloch. Bloch observed that Japanese consists 
of sequences that may be bounded by pauses. These he labeled pause- 
groups. A pause-group which does not contain within it any fraction 
which itself occurs elsewhere as a pause-group is a minimal pause-group. 
For example, the sequence /anomatíkara/ 'from that town' contains the 
sequence /mati/ 'town', which itself is a minimal pause-group. Holding 
that all minimal pause-groups are words, Bloch observed that they may 
have one or more high-pitched syllables, but that when they have several 
high-pitched syllables, these occur only contiguously - never being sepa- 
rated by a low-pitched sequence. Considering the accent patterns of the 
minimal pause-groups as models for the accent of the word, Bloch then 
maintained that a pause-group containing high-pitched sequences separated 
by a low-pitched sequence must contain at least two words. Thus the se- 
quence /dónotatémono/ 'which building' must contain at least two words. 
From this position Bloch proceeded to analyze pause-groups containing 
several words, but having the accent patterns of single words, by com- 
paring their grammatical structure with that of sequences where the ac- 
cent clearly shows the presence of more than one word. Thus by analogy, 
the sequence /sonoagemono/ 'that fried food', although it has no accent, 


and thus resembles some minimal pause-groups, must contain the same num- 
ber of words as the sequence /dónotatémono/, because it has the same 
number of morphs and the same relationships between them. 


Bloch now went one step farther and adopted the ‘guiding principle' 
that no form has as its immediate constituents a phrase of two or more 
words on the one hand, and a bound form (i.e. a constituent of a word) 
on the other. This means, for example, that /e/ 'to' in the sequence 
/anotatémonoe/ 'to that building' must be a word, even though it never 
occurs as a pause-group, because the immediate constitutents of the ex- 
pression are /anotatémono/ 'that building' - a pause-group containing 
two words - and /e/ 'to'. By this means Bloch was able to account for 
the particle having the status of words. 


Bloch was now faced with deciding the status of /rasii/ 'apparently'. 
This he considered to be a bound form on the basis of accent, in spite of 
its immediate constituent relationships. 


Now these are subtleties with which the average Japanese can scarce- 
ly be expected to cope. Eleanor Jorden, in her dissertation,“ in point- 
ing out deficiencies in Bloch's definition of the word in Japanese, at- 
tempted to define the word, or rather the lexeme, more rigorously in 
terms of immediate constituents - yet the lexeme boundaries she arrived 
at are virtually the same as Bloch's word boundaries, which in turn are 
virtually the same boundaries that Americans and Englishmen have been 
using for the past hundred years - the very same boundaries indeed, that 
native Japanese find confusing. 


In my own dissertation I took the position that it might be possible 
to describe Japanese adequately without reference to lexemes (i.e. to 
syntactically determined words), while limiting my treatment of the word 
to those cases where junctures and accent patterns had to be accounted 
for. I confined myself to speculating that phonologically determined 
words would be considerably longer and morphologically much more complex 
than the lexemes that Jorden had proposed. 


Since that time something has occurred which has attracted me to 
the concept of longer words. A body of Japanese literature has come to 
my attention, which, as far as I know, has never seriously been con- 
sidered by linguists, but which throws a bright light on the question 
of the word as conceived by the Japanese themselves. I refer to that 
body of literature which is directed to children - ages six to nine. 

The style of this literature is quite close to the colloquial. It is 
written phonetically, that is to say, in straight hiragana syllabary, and 
most significant of all - word boundaries are provided. 


But the word boundaries of these hiragana texts are quite different 
from those of romanized texts, as the following example will show - the 
major difference being that the 'words' that Bloch and Jorden call the 
particles and forms of the copula, appear as suffixes, i.e. as bound 
forms. 


In romanizing the following hiragana text, I have used the system 
developed by Bloch and Jorden for use in their book Spoken Japanese, with 
the single modification that the miniscule tu which marks consonantal 
length has been represented by a q. But I have applied the system to 
each hiragana symbol separately, adhering rigidly to the original spell- 
ing, punctuation, and (of course) spacing. 


The text is as follows: 


TENGUNO UTIWA 


mukasi, aru murano otokono koga, saikorowa huqte asonde imasita. 
suruto, tenguga yaqte kite, "hou, omosirosouna monodana. tyoqto kasite 
kurenaika." "kasunoha iyadayo.' "deha, kono utiwato torikaeqko siyou."' 
"kono utiwade aoguto, hitono hanaga nobite takaku naru. uragaesite 


aoguto, hikuku naruyo." to, tenguha iimasita. 


otokono koha, saikoroto utiwawo torikaemasita. "yosi, korewo 
tamesite miyou." 


aruite ikuto, mukoukara kagoga yaqte kimasita. kagono nakaniha, 
musumesanga noqte imasu. okanemotino musumesande, omiyamairino 
kaeridesita. 


otokono koga, utiwade aoide, "hanayo, takaku nare." to, omazinaiwo 
iuto, musumesanno hanaga, mirumiru takaku narimasita. minna biqkuri- 
simasita. hanano nobita musumesanha, "aa, dousitara yoikasira.' to 
kanasinde, byoukini naqte simaimasita. otousanya okaasanmo sinpaisite, 
oisyasamani mite moraqtari, iroirona kuSuriwo nomasetari simasita. demo, 
musumesanno hanaha, mizikaku narimasen. 


sokode, musumesanno iedeha, monno maeni tatehudawo tatemasita. 
soreniha, 'musumeno hanawo naosite kureta monowo, musumeno mukoni suru." 
to kaite arimasu. minnaga sorewo yomimasita. tenguno utiwawo moqta 
otokono komo yomimasita. 


otokono koha, okanemotino iehe iqte, nete iru musumesanno makura- 
motoni suwarimasita. 


tenguno utiwawo uragaesite aoginagara, "hanayo, hikiku naare." 
to iuto,hanaha, motonoyouni narimasita. musumesanno oyatatiha ooyoro- 
kobidesu. 


otokono koha, okanemotino ieno omukosanni narimasita. 
aru hi, otokono koha, niwano suzumidaino ueni nekoronde, tenguno 
utiwade zibunwo aoginagara iimasita. "hanayo, takaku naare." hanaha, 


gungun nobite ikimasu. "yaa, omosiroizo." 


hanaha, toutou, sorano kumomade todokimasita. kaminariga sorewo 
mitukete, osaemasita. otokono koha biqkurisite, "hanayo, mizikaku naare." 


to iqte, sorawoO mimasita. 


demo, kaminariga hanawo osaete irunode, otokono koha sorahe 
hiqpariagerarete, sono mama kaeqte kimasendesita. 


For comparison I append the text as it would appear in romanized 
form (following the Bloch-Jorden system). 


Mukasi, dru mura no otokó no ko ga, saikóro o hutte asonde imásita. 
Suru to, tefgu ga yatte kite, 'Hóo, omosirosóo na mono da na. Tydtto 
kasite kurendi ka?" "Kasi no wa iyá da yo!" "Dé wa, kono utiwa to 
torikaekko siyoo! Kono utiwa de aógu to, hito no hana ga nóbite tákaku 
naru. Uragdesite aógu to, hikuku naru yo." to, tengu wa iimdsita. 


Otokó no ko wa, saikóro to utiwa o torikaemásita. '"Ydsi, kore o 
tamésite miyoo." 


Arúite iku to, mukoo kara kago ga yatte kimásita. Kago no náka ni 
wa, musume-san ga notte imasu. Okanemoti no musume-san de, omiyamáiri 
no kaerí desita. 


Otokó no ko ga, utíwa de adide, "Hana yo, tákaku nare.' to, omazinai 
o iu to, musumesan no hana ga, mfrumiru tákaku narimasita. Mina bikkúri- 
simasita. Hana no nóbita musumesan wa, "La, dóo sitara yói kasira." to 
kanasínde, byooki ni nátte simaimasita. Otdosan ya okdasan mo sinpai- 
site, oisyasama ni mite morattari, iroiro na kusuri o nomásetari simasita. 
Dé mo, musumesan no hana wa, mizikáku narimasén. 


Soko de, musumesan no ié de wa, món no mae ni tatéhuda o tatemásita. 
Sore ni wa, "Musume no hana o naósite kureta mond o, musume no miko ni 
suru." to káite arimasu. Mifind ga sore o yomimásita. Tengu no utíwa o 
mótta otokó no ko mo yomimásita. 


Otokó no ko wa, okanemoti no ié e itte, nete iru musumesan no 
makurámoto ni suwarimásita. 


Tengu no utiwa o uragdesite aoginágara, "Hana yo, hfkuku náare." 
to iu to, hana wa, móto no yoo ni narimdsita. Musumesan no oydtati wa 
ooyórokobi desu. 


Otokó no ko wa, okanemoti no ie no omúkosan ni narimasita. 


r — 

Aru hi, otokó no ko wa, niwa no suzumidaí no ué ni nekorónde, 
tengu no utiwa de zibuñ o aoginágara iimdsita. "Hana yo, tdkaku ndare." 
Hana wa, gúngun nóbite ikimasu. '"Yda, omosirói zo." 


Hana wa, tdotoo, sóra no kúmo made todokimdsita. Kamindri ga sore 
o mitukete, osaemásita. Otokd no ko wa bikkúri-site, "Hana yo, mizikdki 
náare.' to itte, sóra o mimásita. 


Dé mo, kaminári ga hana o osdete iru no de, otokd no ko wa sóra e 
hippariagerárete, sono mama káette kimasen desita. 


10 


The materials which I have been able to examine since this type of 
literature came to my attention, and of which the above sample is rep- 
resentative, number about 300 pages. They are from three different 
publishing houses, including one which prints school textbooks. In the 
case of the textbooks, the hiragana is gradually replaced by the Chinese 
characters which the children are supposed to have learned; and it is 
interesting to note that, although this process continues throughout 
the child's education, the spaces between the words are discarded in 
the second semester of the third grade. 


There is some fluctuation in the word boundaries as they are printed 
by the different publishing houses, but it is significant that in no case 
are the particles and copular forms printed as separate words. Aside 
from this, the following points are of minor interest. 


1. The quotative particle /to/ is printed as a separate word, but 
when /to/ means 'if; when' it is printed as a suffix. 


2. The auxiliary noun stems /no/ 'fact' and /yoo/ 'like' are always 
printed as suffixes. Other auxiliary noun stems (e.g. /hoo/ ‘alternative’, 
/koto/ 'action', /toki/ 'time') are usually printed as separate words, 
although they are to no less an extent bound forms, and, like /no/ and 
/yoo/, always belong to the same pause group as the preceding stem. 


The auxiliary noun stem /mono/ 'usually' is printed as a separate 
word, but when the form /mono/ means 'because', it is printed as a suffix, 
e.g. siranaindesumono ‘because I don't know'. 


One of the publishing houses is more inclined than the others to 
print auxiliary stems as suffixes, e.g. siteita 'knew', moqtekite ‘bring’, 
kaeqteikimasita ‘went home'; though it is inconsistent in this matter. 
Where a contracted form is used, all three firms agree in printing the 
contraction as a single word, e.g. ibagterunaa ‘they're stuck up, aren't 
they’. 


3. Forms of the verb /suru/ 'to do' are usually printed as suffixes 
to a noun stem which has no other suffixes, though there is some incon- 
sistency (cf. bigkurisimasita ‘was surprised' and sinpaisite 'being 
worried' with torikaegko siyou 'let's trade' and dou sitara 'what should 
I do', all from the above text. In other texts I have encountered 
dousitemo 'by all means' and kousite 'doing thus'). 


Whenever the preceding noun stem has a suffix, the following form 
of /suru/ is printed as a separate word. In this connection zigto 
simasu 'stands still' occurs as two words, as if the -to of zigto were 
felt to be the adverbial suffix -to. 


4. Certain combinations with /no/ 'of' are printed as single words, 
e.g. matunoki 'pine tree', yumenokuni 'dreamland', though not consistent- 
ly. One might expect to find otokono ko 'boy' printed as one word, but 
such is not the case. 


11 


5. /kono/ 'this', /sono/ 'that', etc. are always printed as sepa- 
rate words, the /-no/ possibly being equated with the common noun suffix 
(or particle) /-no/. 


The punctuation in this literature is interesting in that the commas 
correspond closely to the points where single-bar juncture would occur in 
speech. In fact, with regard to the question of juncture, the relation- 
ship between the word boundaries in this literature and the occurrence of 
plus-juncture in speech is much closer than is the case with our roman- 
ized texts. | 


To assess the reaction of native Japanese to these texts, I feigned 
naiveté and asked what the spaces were for. Repeatedly I was informed 
that it would be difficult for a child to understand a text without 
spaces. Usually I was told that the spaces provided points where the 
child could stop without making the text unintelligible. When I inquired 
further if it wouldn't be an improvement to provide spaces before the 
particles and copula, the reply was invariably negative: "Breaking up 
the 'words' would confuse the children." 


If the Japanese regard the copula and particles as suffixes, what is 
the justification for treating them as separate words? It is admittedly 
not a phonological matter. One is tempted to suppose that speakers of 
Western European languages, having found a similarity of function between 
the particles of Japanese and the prepositions of their own languages, 
and between the Japanese copula and the copular uses of the European verb 
'to be', have reasoned that what are words in their own languages, should 
be words in Japanese. Had the first Westerners to romanize Japanese 
spoken languages similar in structure to Turkish, is it likely that the 
word boundaries they might have supplied would have treated copula and 
particles as separate words? 


Even though Westerners naturally tend to regard Japanese nouns as 
uninflected, lacking as they do the familiar inflectional categories 
of number and gender, they would not have treated them thus had it not 
been for the low degree of fusion between noun stems and their suffixes. 
I say low degree because it is not true to say that there is no fusion. 
Aside from the alternation /náni/ - /nán/ 'what', where /nán/ occurs 
only before certain suffixes, there may not be many significant instances 
of segmental fusion, but the fusion at the supra-segmental level is ex- 
tensive, the prediction of the accent of the combination noun-stem plus 
suffix being a complex matter, cf. /fma-kara/ 'from now on' with /ima- 
mdde/ ‘up till now', or /kindo-kara/ 'since yesterday' with /kinoo-no 
gógo/ ‘yesterday afternoon'. This particular type of fusion tends to 
be overlooked by Westerners since it is not directly represented in the 
native orthography (and seldom observed by the Westerner in speaking). 


When it comes to determining word boundaries through syntax, the 
concept of immediate constituents is at best difficult to apply. If we 
are to regard /e/ 'to' as one of the immediate constituents of /anotaté- 
mono-e/ ‘to that building', and then to call it a word because /ano 


12 


tatémono/ is two words, what is to prevent us from deciding that /ta/ 
'past tense’ is not an immediate constituent of /takusantdbe-ta/ ‘ate a 
lot' (one immediate constituent being the tense and the other being the 
action ‘eating a lot'), and then calling /ta/ a word because /takusan 
tdbe/ is two words? Of course we don't do this. To begin with, there 
is considerable fusion in Japanese between verb stems and tense suffixes. 
But quite apart from this, we subconsciously treat tense morphs as af- 
fixes. We do it even in the case of the Japanese adjectives, where the 
fusion is little greater than for the noun. 


As a result of such analysis, Japanese appears to have a rather 
lopsided structure. The major word classes of verb and adjective are 
inflected, but the other major word class consists of the uninflected 
nouns. In addition there is a class consisting of one inflected word - 
the copula. 


Have we not, in effect, taken a language which has three major clas- 
ses of inflected words - nouns, verbs, and adjectives - and stripped one 
class of its inflection, and then called that inflection the copula and 
the particles? If so, then it follows that when we ask unsophisticated 
Japanese to do the same, they may not always be sure when to write 
'suffixes' as independent 'words'. 


Notes 


1 Bernard Bloch, Studies in Colloquial Japanese II: Syntax, Language, 
1946; reprinted in Readings in Linguistics, American Council of 
Learned Societies, 1957. 


2 Eleanor Jorden, The Syntax of Modern Colloquial Japanese, Language, 
1955, Language Dissertation No. 52. 


3 This is not to say that the Japanese can't learn them, but merely 
that the thing isn't as automatic for them as it is for us. 


